Likelihood / Regression

The others side of the coin

Statistics: Interested in estimating population-level characteristics; i.e., the parameters

\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

Estimation

  • likelihood
  • Bayesian

Likelihood

Likelihood principle

Given a statistical model, all the evidence/information in a sample (\(\textbf{y}\), i.e., data) relevant to model parameters (\(\theta\)) is contained in the likelihood function.

. . .

Fisher Information

The information an observable random variable (\(\textbf{y}\)) has about an unknown parameter \(\theta\) upon which the probability of \(\textbf{y}\) \((f(\textbf{y};\theta)\) depends.


. . .


Information is conditional

. . .

To learn about \(\theta\) from \(\textbf{y}\), we need to link them together via a special function, \(f(\textbf{y};\theta)\)

## Statistical Information


The pieces:

::: incremental - The sample data, \(\textbf{y}\)

  • A probability function for \(\textbf{y}\):

    • \(f(\textbf{y};\theta)\)

    • \([\textbf{y}|\theta]\)

  • The unknown parameter: \(\theta\)

    • specified in the probability function

:::

The Likelihood Function

What we can say about our parameters using this function?

\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|y) = P(y|\boldsymbol{\theta}) = f(y|\boldsymbol{\theta}) \end{align*} \]

. . .

The likelihood (\(\mathcal{L}\)) of the unknown parameters, given our data, can be calculated using our probability function.

. . .

CODE:

# A data point
  y=c(10)

#the likelihood the mean is 8, given our data
  dnorm(y,mean=8)
[1] 0.05399097


. . .

If we knew the mean is truly 8, it would also be the probability density of the observation y = 10.

Many Parameter Guesses

# Let's take many guesses of the mean
  means=seq(0,20,by=0.1)

# Use dnorm to get likelihood of each guess of the mean
# Assumes sd = 1
  likelihood=dnorm(y, mean=means)

. . .

Maximum Likelihoof Properties

  • MLEs are consistent. As sample size increases, they will converge to the true parameter value.

  • MLEs are asymptotically unbiased. The \(E[\hat{\theta}]\) converges to \(\theta\) as the sample size gets larger.

  • No guarantee that MLE is unbiased as small sample size. Can be tested!

  • MLEs will have the minimum variance among all estimators, as the sample size gets larger.

Statistics and PDF Example

What is the mean height of King Penguins?

Statistics and PDF Example

We go and collect data,

\(\boldsymbol{y} = \begin{matrix} [4.34 & 3.53 & 3.75] \end{matrix}\)


. . .

Let’s decide to use the Normal Distribution as our PDF.

. . .

\[ \begin{align*} f(y_1 = 4.34|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{1}-\mu}{\sigma})^2} \\ \end{align*} \]

. . .

AND

\[ \begin{align*} f(y_2 = 3.53|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{2}-\mu}{\sigma})^2} \\ \end{align*} \] . . .

AND

\[ \begin{align*} f(y_3 = 3.75|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{3}-\mu}{\sigma})^2} \\ \end{align*} \]

. . .

Or simply,

\[ \textbf{y} \stackrel{iid}{\sim} \text{Normal}(\mu, \sigma) \] . . .

\(iid\) = independent and identically distributed

. . .

Continued

The joint probability of our data with shared parameters \(\mu\) and \(\sigma\),

\[ \begin{align*} & P(Y_{1} = y_1,Y_{2} = y_2, Y_{3} = y_3 | \mu, \sigma) \\ &= \mathcal{L}(\mu, \sigma|\textbf{y}) \end{align*} \]

. . .

IF each \(y_{i}\) is independent, the joint probability of our data are simply the multiplication of all three probability densities,

\[ \begin{align*} =& f(y_{1}|\mu, \sigma)\times f(y_{2}|\mu, \sigma)\times f(y_{3}|\mu, \sigma) \end{align*} \]

We can do this because we are assuming knowing one value (\(y_1\)) does not tell us any new information about another value \(y_2\).

. . .

\[ \begin{align*} =& \prod_{i=1}^{3} f(y_{i}|\mu, \sigma) \\ =& \mathcal{L}(\mu, \sigma|y_{1},y_{2},y_{3}) \end{align*} \]

Code

Translate the math to code…

# penguin height data
  y=c(4.34, 3.53, 3.75)

#Joint likelihood of mu=3, sigma =1, given our data
  prod(dnorm(y,mean=3,sd=1))
[1] 0.01696987


. . .

Calcualte likelihood of many guesses of \(\mu\) and \(\sigma\) simultaneously,

# The Guesses
  mu=seq(0,6,0.05)
  sigma=seq(0.01,2,0.05)
  try=expand.grid(mu,sigma)
  colnames(try)=c("mu","sigma")

# function
fun=function(a,b){
  prod(dnorm(y,mean=a,sd=b))
  }

# mapply the function with the inputs
  likelihood=mapply(a=try$mu,b=try$sigma, FUN=fun)

# maximum likelihood of parameters
  try[which.max(likelihood),]
      mu sigma
925 3.85  0.36


. . .

Likelihood plot (3D)

Sample Size

What happens to the likelihood if we increase the sample size to N=100?

. . .